CRUXEval-output

p-values

The null hypothesis is that model A and B each have a 1/2 chance to win whenever they are different, ties are not used. The p-value is the chance under the null-hypothesis to get a difference as extreme as the one observed. Hoover over each entry to display the information used to compute p-values.

Typical delta to give good p-values

We can also find the typical p-value for typical difference in accuracy. Hoover to display the actual model pairs for each point.

Pairwise wins (including ties)

Following Chatbot Arena, this is the head-to-head comparisons between all pairs of models, reporting wins, and two types of ties.

Result table

We show 3 methods currently used for evaluating code models, raw accuracy used by benchmarks, average win-rate over all other models, and Elo (technically Bradly-Terry coefficients following Chatbot Arena). These usually have near-perfect correlation.

model pass1 win_rate elo
0 gpt-4-turbo-2024-04-09+cot 0.820 0.936 1544.180
1 gpt-4-0613+cot 0.771 0.928 1519.130
2 claude-3-opus-20240229+cot 0.820 0.876 1404.876
3 gpt-3.5-turbo-0613+cot 0.590 0.823 1323.343
4 gpt-4-0613 0.687 0.762 1235.863
5 codellama-34b+cot 0.436 0.752 1225.902
6 gpt-4-turbo-2024-04-09 0.677 0.744 1215.286
7 claude-3-opus-20240229 0.657 0.697 1163.230
8 codellama-13b+cot 0.360 0.694 1162.080
9 codellama-7b+cot 0.299 0.601 1070.951
10 deepseek-base-33b 0.486 0.547 1019.993
11 deepseek-instruct-33b 0.499 0.546 1019.080
12 gpt-3.5-turbo-0613 0.494 0.523 1000.000
13 codetulu-2-34b 0.458 0.511 987.470
14 deepseek-base-6.7b 0.435 0.497 976.538
15 magicoder-ds-7b 0.444 0.479 960.910
16 codellama-34b 0.424 0.461 945.742
17 mixtral-8x7b 0.405 0.460 946.291
18 codellama-13b 0.397 0.439 927.468
19 wizard-34b 0.434 0.414 906.376
20 wizard-13b 0.413 0.411 904.437
21 codellama-python-34b 0.414 0.402 897.732
22 codellama-python-13b 0.398 0.398 893.673
23 deepseek-instruct-6.7b 0.412 0.373 873.682
24 phind 0.397 0.362 864.398
25 phi-2 0.335 0.346 849.806
26 codellama-python-7b 0.359 0.344 848.583
27 mistral-7b 0.343 0.329 833.922
28 codellama-7b 0.342 0.327 833.731
29 starcoderbase-16b 0.342 0.323 828.742
30 deepseek-base-1.3b 0.310 0.284 794.441
31 starcoderbase-7b 0.322 0.271 782.240
32 phi-1.5 0.275 0.257 768.074
33 deepseek-instruct-1.3b 0.287 0.229 740.721
34 phi-1 0.217 0.154 654.672